Context: EHR data is becoming a key source of real-world evidence (RWE) for the pharmaceutical industry and regulators to make decisions on clinical trials. You are a data scientist for an exciting unicorn healthcare startup that has created a groundbreaking diabetes drug that is ready for clinical trial testing. It is a very unique and sensitive drug that requires administering the drug over at least 5-7 days of time in the hospital with frequent monitoring/testing and patient medication adherence training with a mobile application. You have been provided a patient dataset from a client partner and are tasked with building a predictive model that can identify which type of patients the company should focus their efforts testing this drug on. Target patients are people that are likely to be in the hospital for this duration of time and will not incur significant additional costs for administering this drug to the patient and monitoring.
In order to achieve your goal you must build a regression model that can predict the estimated hospitalization time for a patient and use this to select/filter patients for your study.
Expected Hospitalization Time Regression Model: Utilizing a synthetic dataset(denormalized at the line level augmentation) built off of the UCI Diabetes readmission dataset, students will build a regression model that predicts the expected days of hospitalization time and then convert this to a binary prediction of whether to include or exclude that patient from the clinical trial.
This project will demonstrate the importance of building the right data representation at the encounter level, with appropriate filtering and preprocessing/feature engineering of key medical code sets. This project will also require students to analyze and interpret their model for biases across key demographic groups.
Please see the project rubric online for more details on the areas your project will be evaluated.
Due to healthcare PHI regulations (HIPAA, HITECH), there are limited number of publicly available datasets and some datasets require training and approval. So, for the purpose of this exercise, we are using a dataset from UC Irvine(https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008) that has been modified for this course. Please note that it is limited in its representation of some key features such as diagnosis codes which are usually an unordered list in 835s/837s (the HL7 standard interchange formats used for claims and remits).
Data Schema The dataset reference information can be https://github.com/udacity/nd320-c1-emr-data-starter/blob/master/project/data_schema_references/ . There are two CSVs that provide more details on the fields and some of the mapped values.
When submitting this project, make sure to run all the cells before saving the notebook. Save the notebook file as "student_project_submission.ipynb" and save another copy as an HTML file by clicking "File" -> "Download as.."->"html". Include the "utils.py" and "student_utils.py" files in your submission. The student_utils.py should be where you put most of your code that you write and the summary and text explanations should be written inline in the notebook. Once you download these files, compress them into one zip file for submission.
For step by step instructions on creating your environment, please go to https://github.com/udacity/nd320-c1-emr-data-starter/blob/master/project/README.md.
By the end of the project, you will be able to
# install packages
import sys
import pkg_resources
from IPython.display import display, Markdown, Latex, clear_output, display_html
# check if plotly exists in the library
# if not, install the package & restart the notebook
if len([True for d in pkg_resources.working_set if d.project_name in ['plotly']]) == 0:
# install plotly
!{sys.executable} -m pip install plotly # you may want to restart your kernel if it can't find the plotly package
# after installing it
# restart the notebook
clear_output()
display(Markdown('<br/><span style="color: #dd0000; font-size: 34px; font-weight: 700;">Re-run your cells!<br/> The notebook has been restarted!</span>'))
display_html("<script>Jupyter.notebook.kernel.restart()</script>",raw=True)
# from __future__ import absolute_import, division, print_function, unicode_literals
import os
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers
import tensorflow_probability as tfp
import matplotlib.pyplot as plt
import pandas as pd
import aequitas as ae
from IPython.display import display, Markdown, Latex, clear_output
import plotly.graph_objects as go
from plotly.subplots import make_subplots
pd.options.plotting.backend = "plotly"
# Put all of the helper functions in utils
from utils import build_vocab_files, show_group_stats_viz, aggregate_dataset, preprocess_df, df_to_dataset, posterior_mean_field, prior_trainable
pd.set_option('display.max_columns', 500)
# this allows you to make changes and save in student_utils.py and the file is reloaded every time you run a code block
%load_ext autoreload
%autoreload
#OPEN ISSUE ON MAC OSX for TF model training
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'
Load the dataset and view a sample of the dataset along with reviewing the schema reference files to gain a deeper understanding of the dataset. The dataset is located at the following path https://github.com/udacity/nd320-c1-emr-data-starter/blob/master/project/starter_code/data/final_project_dataset.csv. Also, review the information found in the data schema https://github.com/udacity/nd320-c1-emr-data-starter/blob/master/project/data_schema_references/
dataset_path = "./data/final_project_dataset.csv"
#
# Add ?, '?|?', 'None' as not a number
#
df = pd.read_csv(dataset_path, na_values=[ '', ' ', '?', '?|?','None', '-NaN', '-nan', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', 'null'])
df.head()
Question 1: Based off of analysis of the data, what level is this dataset? Is it at the line or encounter level? Are there any key fields besides the encounter_id and patient_nbr fields that we should use to aggregate on? Knowing this information will help inform us what level of aggregation is necessary for future steps and is a step that is often overlooked.
txt = f'<b>Amount of Rows: </b>{df.shape[0]} \n'
txt += f'<b>Amount of Patients: </b>{df["patient_nbr"].nunique()} \n'
txt += f'<b>Amount of Encounter IDs: </b>{df["encounter_id"].nunique()} \n'
txt += '\n<b>Line or Encounter: </b>'
if ('encounter_id' in list(df)) and (len(df) > df['encounter_id'].nunique()):
txt += "Dataset could be at the **Line Level**"
elif ('encounter_id' in list(df)) and (len(df) == df['encounter_id'].nunique()):
txt += "Dataset could be at the **Encounter Level**"
else:
txt += "Dataset could be at the **Longitudinal Level**"
display(Markdown(txt))
Answer: Dataset is at the Line Level, because there are more rows than Encounter Ids.
#
# According to the documentation, i've put the features into 2 sets
# URL: https://www.hindawi.com/journals/bmri/2014/781670/tab1/
#
#
# patient and encounter id
# (statistically not interesting)
ids = ['encounter_id','patient_nbr']
#
# categorical (nominal) features
#
features_categorical = ['race',
'gender',
'age',
'weight', # documentation says numeric, but here it is categorical
'admission_type_id',
'discharge_disposition_id',
'admission_source_id',
'payer_code',
'medical_specialty',
'primary_diagnosis_code',
'other_diagnosis_codes',
'ndc_code',
'max_glu_serum',
'A1Cresult',
'change',
'readmitted']
#
# numerical features
#
features_numerical = ['time_in_hospital', # Integer number of days between admission and discharge
'number_outpatient',
'number_inpatient',
'number_emergency',
'num_lab_procedures',
'number_diagnoses',
'num_medications',
'num_procedures']
Question 2: Utilizing the library of your choice (recommend Pandas and Seaborn or matplotlib though), perform exploratory data analysis on the dataset. In particular be sure to address the following questions:
- a. Field(s) with high amount of missing/zero values
- b. Based off the frequency histogram for each numerical field,
which numerical field(s) has/have a Gaussian(normal) distribution shape?
- c. Which field(s) have high cardinality and why (HINT: ndc_code is one feature)
- d. Please describe the demographic distributions in the dataset for the age and gender fields.
OPTIONAL: Use the Tensorflow Data Validation and Analysis library to complete.
df_columns_info = [{'Feature': col,
'Frquency': df.shape[0],
'# Freq_Null': df[col].isnull().sum(),
'% Freq_Null': np.around(df[col].isnull().sum()/df.shape[0]*100, 1),
'# Freq_Zeros': '-',
'% Freq_Zeros': '-',
'Type': 'Categorical'} for col in list(features_categorical)]
df_columns_info += [{'Feature': col,
'Frquency': df.shape[0],
'# Freq_Null': df[col].isnull().sum(),
'% Freq_Null': np.around(df[col].isnull().sum()/df.shape[0]*100, 1),
'# Freq_Zeros': (df[col]==0).sum(),
'% Freq_Zeros': np.around((df[col] == 0).sum()/df.shape[0]*100, 1),
'Type': 'Numerical'} for col in list(features_numerical)]
# show missings and zeros
txt = '## Missings and zeros '
display(Markdown(txt))
pd.DataFrame(df_columns_info)
fig = make_subplots(rows=int(np.ceil(len(features_numerical)/2)), cols=2)
for e, feature in enumerate(features_numerical, start=2):
# create
trace = go.Histogram(x=df[feature], name=feature)
# store the plot
fig.append_trace(trace, e//2, e%2 + 1)
# update figure
fig.update_layout( title="Histograms",
legend_title="Features",
font=dict(
size=16,
)
)
display(Markdown("## Numerical Features: Histgrams"))
fig
Answer: There are 3 clear Gaussian normal distribution
- time_in_hospital
- num_lab_procedures
- num_medications
for feature in features_categorical:
# create
vc = df[feature].fillna('None').value_counts().reset_index()
vc.columns = ['Value', '# Amount']
vc['% Perc.'] = np.around(vc['# Amount']/vc['# Amount'].sum()*100, 1)
# print result
text = f'### Feature: {feature} \n'
text += f'<b>Nr of categories</b>: {vc.shape[0]} \n'
text += f'<b>Missing values</b>: {df[feature].isnull().sum() > 0} \n'
text += f'<b>High Cardinality</b>: {"Yes" if vc.shape[0] > 100 else "No"}'
display(Markdown(text))
display(vc)
print('')
Answer: from the above tables, if I set my threshold for cardinality to 100, then I have 3 high cardinally features:
- other_diagnosis_codes
- primary_diagnosis_code
- ndc_code
text = '<b>Absolute:</b> Distribution of Age by gender.'
df_dist = pd.crosstab(df['gender'], df['age'])
df_dist['Total'] = df_dist.sum(axis=1)
df_dist['% Total'] = np.around(df_dist['Total']/df_dist['Total'].sum()*100, 1)
df_dist = df_dist.T
df_dist['Total'] = df_dist.sum(axis=1)
df_dist['% Total'] = np.around(df_dist['Total']/df_dist.loc['Total', 'Total']*100, 1)
df_dist = df_dist.T
df_dist.loc['% Total', '% Total'] = 100
df_dist.iloc[:-1,:-1] = df_dist.iloc[:-1,:-1].astype(np.int32).astype(str)
print('')
display(Markdown(text))
display(df_dist)
df_dist.astype(np.float32).T.iloc[:-1,:-2][['Male','Female','Unknown/Invalid']].plot(kind='bar', barmode="group", title="Distribution: Age - Gender (Absolute)")
text = '<b>Relative %:</b> Show the distribution of Age by gender. \n'
text += 'How would the distribution look like, if an equal number of men and women were represented.'
df_dist = pd.crosstab(df['gender'], df['age'])
df_dist = np.around(df_dist / df_dist.sum(axis=1)[:,None] * 100,3)
df_dist['Total'] = df_dist.sum(axis=1)
print('')
display(Markdown(text))
display(df_dist)
df_dist.T[['Male', 'Female']].plot(kind='bar', barmode="group", title="Distribution: Gender - Age (Relative)")
text = f"**Answer**: From the above information & interactive plots :), we can observe the following things: \n"
text += f"- There are more Female then Male patients in the dataset: 53.1% vs 46.9% \n"
text += f"- The most common age group is: [70-80], they represent 25.7% of the data \n"
text += f"- 92% of the data has a age value between [40 and 90[ \n"
text += f"- Via the relative table and figure, we can notice that up to the age group of 70, we note that men are more often represented than women \n"
text += f" (Probably because men die earlier?) \n"
display(Markdown(text))
Question 3: NDC codes are a common format to represent the wide variety of drugs that are prescribed for patient care in the United States. The challenge is that there are many codes that map to the same or similar drug. You are provided with the ndc drug lookup file https://github.com/udacity/nd320-c1-emr-data-starter/blob/master/project/data_schema_references/ndc_lookup_table.csv derived from the National Drug Codes List site(https://ndclist.com/). Please use this file to come up with a way to reduce the dimensionality of this field and create a new field in the dataset called "generic_drug_name" in the output dataframe.
#NDC code lookup file
ndc_code_path = "./medication_lookup_tables/final_ndc_lookup_table"
ndc_code_df = pd.read_csv(ndc_code_path)
ndc_code_df.head(3)
from student_utils import reduce_dimension_ndc
reduce_dim_df = reduce_dimension_ndc(df, ndc_code_df)
# Number of unique values should be less for the new output field
assert df['ndc_code'].nunique() > reduce_dim_df['generic_drug_name'].nunique()
print(f"Tests passed!! {df['ndc_code'].nunique()} - {reduce_dim_df['generic_drug_name'].nunique()}")
Question 4: In order to simplify the aggregation of data for the model, we will only select the first encounter for each patient in the dataset. This is to reduce the risk of data leakage of future patient encounters and to reduce complexity of the data transformation and modeling steps. We will assume that sorting in numerical order on the encounter_id provides the time horizon for determining which encounters come before and after another.
from student_utils import select_first_encounter
first_encounter_df = select_first_encounter(reduce_dim_df)
# unique patients in transformed dataset
unique_patients = first_encounter_df['patient_nbr'].nunique()
print("Number of unique patients:{}".format(unique_patients))
# unique encounters in transformed dataset
unique_encounters = first_encounter_df['encounter_id'].nunique()
print("Number of unique encounters:{}".format(unique_encounters))
original_unique_patient_number = reduce_dim_df['patient_nbr'].nunique()
# number of unique patients should be equal to the number of unique encounters and patients in the final dataset
assert original_unique_patient_number == unique_patients
assert original_unique_patient_number == unique_encounters
print("Tests passed!!")
In order to provide a broad scope of the steps and to prevent students from getting stuck with data transformations, we have selected the aggregation columns and provided a function to build the dataset at the appropriate level. The 'aggregate_dataset" function that you can find in the 'utils.py' file can take the preceding dataframe with the 'generic_drug_name' field and transform the data appropriately for the project.
To make it simpler for students, we are creating dummy columns for each unique generic drug name and adding those are input features to the model. There are other options for data representation but this is out of scope for the time constraints of the course.
exclusion_list = ['generic_drug_name']
grouping_field_list = [c for c in first_encounter_df.columns if c not in exclusion_list]
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# !!! FILL THE NA's with MISSING, or this method doesn't work !!!
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
agg_drug_df, ndc_col_list = aggregate_dataset(first_encounter_df.fillna('Missing'), grouping_field_list, 'generic_drug_name')
assert len(agg_drug_df) == agg_drug_df['patient_nbr'].nunique() == agg_drug_df['encounter_id'].nunique()
ndc_col_list
Question 5: After you have aggregated the dataset to the right level, we can do feature selection (we will include the ndc_col_list, dummy column features too). In the block below, please select the categorical and numerical features that you will use for the model, so that we can create a dataset subset.
For the payer_code and weight fields, please provide whether you think we should include/exclude the field in our model and give a justification/rationale for this based off of the statistics of the data. Feel free to use visualizations or summary statistics to support your choice.
Answer: We wont include payer_code nor weights, because they contain to many missing values.
features_categorical
features_numerical
'''
Please update the list to include the features you think are appropriate for the model
and the field that we will be using to train the model. There are three required demographic features for the model
and I have inserted a list with them already in the categorical list.
These will be required for later steps when analyzing data splits and model biases.
'''
required_demo_col_list = ['race', 'gender', 'age']
student_categorical_col_list = ['primary_diagnosis_code'] + required_demo_col_list + ndc_col_list
student_numerical_col_list = ['num_lab_procedures', 'number_diagnoses', 'num_medications', 'num_procedures']
PREDICTOR_FIELD = 'time_in_hospital'
def select_model_features(df, categorical_col_list, numerical_col_list, PREDICTOR_FIELD, grouping_key='patient_nbr'):
selected_col_list = [grouping_key] + [PREDICTOR_FIELD] + categorical_col_list + numerical_col_list
return agg_drug_df[selected_col_list]
selected_features_df = select_model_features(agg_drug_df, student_categorical_col_list,
student_numerical_col_list,
PREDICTOR_FIELD)
We will cast and impute the dataset before splitting so that we do not have to repeat these steps across the splits in the next step. For imputing, there can be deeper analysis into which features to impute and how to impute but for the sake of time, we are taking a general strategy of imputing zero for only numerical features.
OPTIONAL: What are some potential issues with this approach? Can you recommend a better way and also implement it?
Answer: if you fill in the missing values with zeros, like "number of previous visits" for example, it can have 2 meanings. 1) the patient does not have a "previous visit value" or 2) we don't know. So depending on the model, e.g. in a Random Forest model, here I would replalce the missing values by -1 (because then the model will know, aha a missing value). However, with models such as a neural network, I would more likely replace this value with the median by age group and gender (if the number of missing values for that particular feature is limited)
processed_df = preprocess_df(selected_features_df,
student_categorical_col_list,
student_numerical_col_list,
PREDICTOR_FIELD,
categorical_impute_value='nan',
numerical_impute_value=0)
Question 6: In order to prepare the data for being trained and evaluated by a deep learning model, we will split the dataset into three partitions, with the validation partition used for optimizing the model hyperparameters during training. One of the key parts is that we need to be sure that the data does not accidently leak across partitions.
Please complete the function below to split the input dataset into three partitions(train, validation, test) with the following requirements.
from student_utils import patient_dataset_splitter
d_train, d_val, d_test = patient_dataset_splitter(processed_df, 'patient_nbr')
assert len(d_train) + len(d_val) + len(d_test) == len(processed_df)
print("Test passed for number of total rows equal!")
assert (d_train['patient_nbr'].nunique() + d_val['patient_nbr'].nunique() + d_test['patient_nbr'].nunique()) == agg_drug_df['patient_nbr'].nunique()
print("Test passed for number of unique patients being equal!")
After the split, we should check to see the distribution of key features/groups and make sure that there is representative samples across the partitions. The show_group_stats_viz function in the utils.py file can be used to group and visualize different groups and dataframe partitions.
Below you can see the distributution of the label across your splits. Are the histogram distribution shapes similar across partitions?
pd.options.plotting.backend = "matplotlib"
show_group_stats_viz(processed_df, PREDICTOR_FIELD)
show_group_stats_viz(d_train, PREDICTOR_FIELD)
show_group_stats_viz(d_test, PREDICTOR_FIELD)
We should check that our partitions/splits of the dataset are similar in terms of their demographic profiles. Below you can see how we might visualize and analyze the full dataset vs. the partitions.
# Full dataset before splitting
patient_demo_features = ['race', 'gender', 'age', 'patient_nbr']
patient_group_analysis_df = processed_df[patient_demo_features].groupby('patient_nbr').head(1).reset_index(drop=True)
show_group_stats_viz(patient_group_analysis_df, 'gender')
# Training partition
show_group_stats_viz(d_train, 'gender')
# Test partition
show_group_stats_viz(d_test, 'gender')
We have provided you the function to convert the Pandas dataframe to TF tensors using the TF Dataset API. Please note that this is not a scalable method and for larger datasets, the 'make_csv_dataset' method is recommended -https://www.tensorflow.org/api_docs/python/tf/data/experimental/make_csv_dataset.
# Convert dataset from Pandas dataframes to TF dataset
batch_size = 128
diabetes_train_ds = df_to_dataset(d_train, PREDICTOR_FIELD, batch_size=batch_size)
diabetes_val_ds = df_to_dataset(d_val, PREDICTOR_FIELD, batch_size=batch_size)
diabetes_test_ds = df_to_dataset(d_test, PREDICTOR_FIELD, batch_size=batch_size)
# We use this sample of the dataset to show transformations later
diabetes_batch = next(iter(diabetes_train_ds))[0]
def demo(feature_column, example_batch):
feature_layer = layers.DenseFeatures(feature_column)
print(feature_layer(example_batch))
Before we can create the TF categorical features, we must first create the vocab files with the unique values for a given field that are from the training dataset. Below we have provided a function that you can use that only requires providing the pandas train dataset partition and the list of the categorical columns in a list format. The output variable 'vocab_file_list' will be a list of the file paths that can be used in the next step for creating the categorical features.
#
# Input: d_train - The training data (DataFrame)
# student_categorical_col_list - Categorical features of our dataset
#
# What we want to do?: Writing txt files to a folder, for each Feature (column) in our student_categorical_col_list.
# This txt file contains the unique values from the training set & starts with a default value
#
#
# Code to write a single txt file
#
# Input: vocab_list - a list of values which belongs to a feature (column)
# field_name - Name of the feature
# default_value - we only use the training set, thus it can be that a values doesn't exist yet
# in the validation set, or when we make real predictions
# therefore we'll use a default value
# vocab_dir - Location of the file
#
#
# def write_vocabulary_file(vocab_list, field_name, default_value, vocab_dir='./diabetes_vocab/'):
#
# output_file_path = os.path.join(vocab_dir, str(field_name) + "_vocab.txt") # 1) define Location of the file
#
# # put default value in first row as TF requires # 2) insert the default value(s)
# vocab_list = np.insert(vocab_list, 0, default_value, axis=0)
#
# df = pd.DataFrame(vocab_list).to_csv(output_file_path, index=None, header=None) # 3) write the file, via pandas
# return output_file_path
#
# Code to create the txt files,
#
# def build_vocab_files(df, categorical_column_list, default_value='00'):
# vocab_files_list = []
# for c in categorical_column_list: # 1) Loop over the categorical features
# v_file = write_vocabulary_file(df[c].unique(), c, default_value) # 2) get the unique values + default
# vocab_files_list.append(v_file) # 3) save the path
# return vocab_files_list # 4) return all the paths
#
vocab_file_list = build_vocab_files(d_train, student_categorical_col_list)
vocab_file_list[:10]
Question 7: Using the vocab file list from above that was derived fromt the features you selected earlier, please create categorical features with the Tensorflow Feature Column API, https://www.tensorflow.org/api_docs/python/tf/feature_column. Below is a function to help guide you.
# def create_tf_categorical_feature_cols(categorical_col_list,
# vocab_dir='./diabetes_vocab/'):
# '''
# categorical_col_list: list, categorical field list that will be transformed with TF feature column
# vocab_dir: string, the path where the vocabulary text files are located
# return:
# output_tf_list: list of TF feature columns
# '''
# # key
# ## A unique string identifying the input feature.
# ## It is used as the column name and the dictionary key for feature parsing configs,
# ## feature Tensor objects, and feature columns.
# # vocabulary_file
# ## The vocabulary file name.
# # default_value
# ## The integer ID value to return for out-of-vocabulary feature values,
# ## defaults to -1. This can not be specified with a positive num_oov_buckets.
# # num_oov_buckets
# ## Non-negative integer, the number of out-of-vocabulary buckets.
# ## All out-of-vocabulary inputs will be assigned IDs in the range
# ## [vocabulary_size, vocabulary_size+num_oov_buckets) based on a hash of the input value.
# ## A positive num_oov_buckets can not be specified with default_value.
# output_tf_list = []
# for c in categorical_col_list:
# vocab_file_path = os.path.join(vocab_dir, c + "_vocab.txt")
# '''
# Which TF function allows you to read from a text file and create a categorical feature
# You can use a pattern like this below...
# tf_categorical_feature_column = tf.feature_column.......
# '''
# tf_categorical_feature_column = tf.feature_column\
# .categorical_column_with_vocabulary_file(key=c,
# vocabulary_file = vocab_file_path,
# default_value=0)
# output_tf_list.append( tf.feature_column.indicator_column(tf_categorical_feature_column) )
# return output_tf_list
########
# Info #
########
# Previously: we made some txt files of all the unique values of the features
#
# Goal: Convert the a Feature (column) into a one hot encoded matrix
#
# How? : tf.feature_column.categorical_column_with_vocabulary_file - will read a txt file
# | and is basically nothing more then a list of words | values
# \_________
# # create a kind of lookup table | dictionary \
# layers.DenseFeatures(tf.feature_column.indicator_column( ... ))({'primary_diagnosis_code': ['410','650','2020021']})
#
# --> thiw will return:
# array([[0., 0., 0., ..., 0., 0., 0.],
# [0., 0., 0., ..., 0., 0., 0.],
# [1., 0., 0., ..., 0., 0., 0.]], dtype=float32)>
#
# Thus if found 410 and 650 in the txt file but not 2020021
# And returned the one hot encoded vectors back
from student_utils import create_tf_categorical_feature_cols
tf_cat_col_list = create_tf_categorical_feature_cols(student_categorical_col_list)
test_cat_var1 = tf_cat_col_list[0]
print("Example categorical field:\n{}".format(test_cat_var1))
demo(test_cat_var1, diabetes_batch)
Question 8: Using the TF Feature Column API(https://www.tensorflow.org/api_docs/python/tf/feature_column/), please create normalized Tensorflow numeric features for the model. Try to use the z-score normalizer function below to help as well as the 'calculate_stats_from_train_data' function.
from student_utils import create_tf_numeric_feature
For simplicity the create_tf_numerical_feature_cols function below uses the same normalizer function across all features(z-score normalization) but if you have time feel free to analyze and adapt the normalizer based off the statistical distributions. You may find this as a good resource in determining which transformation fits best for the data https://developers.google.com/machine-learning/data-prep/transform/normalization.
def calculate_stats_from_train_data(df, col):
mean = df[col].describe()['mean']
std = df[col].describe()['std']
return mean, std
def create_tf_numerical_feature_cols(numerical_col_list, train_df):
tf_numeric_col_list = []
for c in numerical_col_list:
mean, std = calculate_stats_from_train_data(train_df, c)
tf_numeric_feature = create_tf_numeric_feature(c, mean, std)
tf_numeric_col_list.append(tf_numeric_feature)
return tf_numeric_col_list
tf_cont_col_list = create_tf_numerical_feature_cols(student_numerical_col_list, d_train)
test_cont_var1 = tf_cont_col_list[0]
print("Example continuous field:\n{}\n".format(test_cont_var1))
demo(test_cont_var1, diabetes_batch)
Now that we have prepared categorical and numerical features using Tensorflow's Feature Column API, we can combine them into a dense vector representation for the model. Below we will create this new input layer, which we will call 'claim_feature_layer'.
claim_feature_columns = tf_cat_col_list + tf_cont_col_list
claim_feature_layer = tf.keras.layers.DenseFeatures(claim_feature_columns)
Below we have provided some boilerplate code for building a model that connects the Sequential API, DenseFeatures, and Tensorflow Probability layers into a deep learning model. There are many opportunities to further optimize and explore different architectures through benchmarking and testing approaches in various research papers, loss and evaluation metrics, learning curves, hyperparameter tuning, TF probability layers, etc. Feel free to modify and explore as you wish.
OPTIONAL: Come up with a more optimal neural network architecture and hyperparameters. Share the process in discovering the architecture and hyperparameters.
def build_sequential_model(feature_layer):
model = tf.keras.Sequential([
feature_layer,
tf.keras.layers.Dropout(.1),
tf.keras.layers.Dense(150, activation='relu'),
tf.keras.layers.Dropout(.4),
tf.keras.layers.Dense(75, activation='relu'),
tfp.layers.DenseVariational(1+1, posterior_mean_field, prior_trainable),
tfp.layers.DistributionLambda(
lambda t:tfp.distributions.Normal(loc=t[..., :1],
scale=1e-3 + tf.math.softplus(0.01 * t[...,1:])
)
),
])
return model
def build_diabetes_model(train_ds, val_ds, feature_layer, epochs=5, loss_metric='mse'):
model = build_sequential_model(feature_layer)
model.compile(optimizer='adam', loss=loss_metric, metrics=[loss_metric])
early_stop = tf.keras.callbacks.EarlyStopping(monitor=loss_metric, patience=3)
history = model.fit(train_ds, validation_data=val_ds,
callbacks=[early_stop],
epochs=epochs)
return model, history
diabetes_model, history = build_diabetes_model(diabetes_train_ds, diabetes_val_ds, claim_feature_layer, epochs=25)
Question 9: Now that we have trained a model with TF Probability layers, we can extract the mean and standard deviation for each prediction. Please fill in the answer for the m and s variables below. The code for getting the predictions is provided for you below.
feature_list = student_categorical_col_list + student_numerical_col_list
diabetes_x_tst = dict(d_test[feature_list])
diabetes_yhat = diabetes_model(diabetes_x_tst)
preds = diabetes_model.predict(diabetes_test_ds)
from student_utils import get_mean_std_from_preds
m, s = get_mean_std_from_preds(diabetes_yhat)
prob_outputs = {
"pred": preds.flatten(),
"actual_value": d_test['time_in_hospital'].values,
"pred_mean": m.numpy().flatten(),
"pred_std": s.numpy().flatten()
}
prob_output_df = pd.DataFrame(prob_outputs)
prob_output_df.head()
Question 10: Given the output predictions, convert it to a binary label for whether the patient meets the time criteria or does not (HINT: use the mean prediction numpy array). The expected output is a numpy array with a 1 or 0 based off if the prediction meets or doesnt meet the criteria.
from student_utils import get_student_binary_prediction
student_binary_prediction = get_student_binary_prediction(prob_output_df, 'pred_mean')
Using the student_binary_prediction output that is a numpy array with binary labels, we can use this to add to a dataframe to better visualize and also to prepare the data for the Aequitas toolkit. The Aequitas toolkit requires that the predictions be mapped to a binary label for the predictions (called 'score' field) and the actual value (called 'label_value').
def add_pred_to_test(test_df, pred_np, demo_col_list):
for c in demo_col_list:
test_df[c] = test_df[c].astype(str)
test_df['score'] = pred_np
test_df['label_value'] = test_df['time_in_hospital'].apply(lambda x: 1 if x >=5 else 0)
return test_df
pred_test_df = add_pred_to_test(d_test, student_binary_prediction, ['race', 'gender'])
pred_test_df[['patient_nbr', 'gender', 'race', 'time_in_hospital', 'score', 'label_value']].head()
from sklearn.metrics import confusion_matrix, roc_auc_score, f1_score
import seaborn as sns
Question 11: Now it is time to use the newly created binary labels in the 'pred_test_df' dataframe to evaluate the model with some common classification metrics. Please create a report summary of the performance of the model and be sure to give the ROC AUC, F1 score(weighted), class precision and recall scores.
For the report please be sure to include the following three parts:
With a non-technical audience in mind, explain the precision-recall tradeoff in regard to how you have optimized your model.
What are some areas of improvement for future iterations?
# calculate the confusion matrix
tn, fp, fn, tp = confusion_matrix(y_true = pred_test_df['label_value'], y_pred = pred_test_df['score']).ravel()
# re order the matrx
cf_matrix = np.array([[tp, fp],[fn, tn]])
# plot
plt.figure(figsize=[13,8])
group_names = ['True Pos', 'False Pos', 'False Neg', 'True Neg']
group_counts = ["{0:0.0f}".format(value) for value in cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.set(font_scale=1.6)
sns.heatmap(pd.DataFrame(cf_matrix, columns=['Real >= 5', 'Real < 5'], index=['Pred >= 5', 'Pred < 5']), annot=labels, fmt="", cmap='Blues');
roc_auc_score(pred_test_df['label_value'], pred_test_df['score'])
The F1 score is the harmonic mean of the precision and recall
$$F_{1} = \frac{tp}{tp + \left ( tp+fn \right )*0.5}$$The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero. The F1 score is also known as the Sørensen–Dice coefficient or Dice similarity coefficient (DSC)
f1_score(pred_test_df['label_value'], pred_test_df['score'], average='weighted')
How well does the model classify?
E.g. if the model predicts 100 times True, how many times was it actually correct.
If this is a high number, then you can trust the model as soon as it predicts a true value.
This does not mean that the model always predicts True.
It may be that the model very often predicts (incorrectly) False,
but as soon as we observe a True value then we are pretty sure that it is indeed true.
tp / (tp + fp)
How many people with a true label, have we been able to classify correctly?
tp / (tp + fn)
The idea behind the precision-recall trade-off is that if you want to focus on a higher precision
(this by raising or lowering the threshold) you will lower the recall. (vice versa for the recall)
E.g.
Threshold = 0.5
Real Value = [ 0, 1, 1, 1 0, 0, 1, 0, 1, 1 ]
Predicted Value = [ 0.1, 0.45, 0.75, 0.6, 0.2, 0.4, 0.45, 0.33, 0.22, 0.9]
Predicted class = [ 0, 0, 1, 1 0, 0, 0, 0, 0, 1 ]
TP = 3
FP = 0
FN = 3
--> precision : 3 / ( 3 + 0 ) = 100 %
recall : 3 / ( 3 + 3 ) = 50 %
from the above example, we notice that our fictional model has a precision of 100%.
(If it predicts, True --> then the real value was indeed True)
But if we ask the question, how many Real True values did it predict correctly, then the answer is only 50%
Threshold = 0.3
Real Value = [ 0, 1, 1, 1 0, 0, 1, 0, 1, 1 ]
Predicted Value = [ 0.1, 0.45, 0.75, 0.6, 0.2, 0.4, 0.45, 0.33, 0.22, 0.9]
Predicted class = [ 0, 1, 1, 1 0, 1, 1, 1, 0, 1 ]
TP = 5
FP = 2
FN = 1
--> precision : 5 / ( 5 + 2 ) = 71 %
recall : 5 / ( 5 + 1 ) = 83 %
If lowering the threshold to 0.3.
Then the model will classify a sample much much much faster as True.
(which reduces the number of FNs !)
(because a FN, is the number of positive samples which were faulty classified as Negative)
--> And thus, the Recall will increase.
But when the Recall increases, the Precision will decrease. Becaues, we'll now generate more FPs.
Using the gender and race fields, we will prepare the data for the Aequitas Toolkit.
# Aequitas
from aequitas.preprocessing import preprocess_input_df
from aequitas.group import Group
from aequitas.plotting import Plot
from aequitas.bias import Bias
from aequitas.fairness import Fairness
ae_subset_df = pred_test_df[['race', 'gender', 'score', 'label_value']]
ae_df, _ = preprocess_input_df(ae_subset_df)
g = Group()
xtab, _ = g.get_crosstabs(ae_df)
absolute_metrics = g.list_absolute_metrics(xtab)
clean_xtab = xtab.fillna(-1)
aqp = Plot()
b = Bias()
Below we have chosen the reference group for our analysis but feel free to select another one.
# test reference group with Caucasian Male
bdf = b.get_disparity_predefined_groups(clean_xtab,
original_df=ae_df,
ref_groups_dict={'race':'Caucasian', 'gender':'Male'
},
alpha=0.05,
check_significance=False)
f = Fairness()
fdf = f.get_group_value_fairness(bdf)
Question 12: For the gender and race fields, please plot two metrics that are important for patient selection below and state whether there is a significant bias in your model across any of the groups along with justification for your statement.
# Plot two metrics
# Is there significant bias in your model for either race or gender?
absolute_metrics = g.list_absolute_metrics(xtab)
xtab[[col for col in xtab.columns if col not in absolute_metrics]]
xtab[['attribute_name', 'attribute_value'] + absolute_metrics].round(2)
_ = aqp.plot_group_metric(xtab, group_metric= 'tpr')
_ = aqp.plot_group_metric(xtab, group_metric= 'tnr')
_ = aqp.plot_group_metric(xtab, group_metric= 'precision')
_ = aqp.plot_group_metric(xtab, group_metric= 'npv')
_ = aqp.plot_group_metric(xtab, group_metric= 'ppr')
Answer:
REAL
TP | FP
------- <-- predicted
FN | TN
- TPR: recall -> TP / (TP + FN) # How many real positive values where classified positive
- TNR: TN / (FP + TN) # How many real negative values where classified negative
- Precision: TP / (TP + FP) # How many Positive predictions where indeed correctly classified as postitive
- NPV: TN / (TN + FN) # How many Negative predictions where indeed correctly classified as Negative
From the above image and tables, we see no direct bias towards Males and Females (Gender).
Nor do we see a clear bias towards race for the next metrics: TPR, TNR, PPV (precision), NPV
However, we do see a large deviation towards the PPR (Predicted Positive Ratio 𝑘 Disparity) metric.
This might be a point which we need to lt we need to look at more closely in the future.
Question 13: Earlier we defined our reference group and then calculated disparity metrics relative to this grouping. Please provide a visualization of the fairness evaluation for this reference group and analyze whether there is disparity.
fpr_disparity = aqp.plot_disparity(bdf, group_metric='fpr_disparity', attribute_name='race')
Answer: All the race groups have a +/- similar chance of being classified as a False Postitive.
other, followed by the Hispanics have the lowest chance to be classified as a False Positive (resp. 0,3 and 0,64)
The probability that a positive prediction is wrong, is almost 1.5 - 3 x lower.
(Bear in mind that, Other and Hispanics do also represent the least number of patients)